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How does network structure affect diffusion? Recent studies suggest that the answer depends on the type of 
contagion. Complex contagions, unlike infectious diseases (simple contagions), are affected by social 
reinforcement and homophily. Hence, the spread within highly clustered communities is enhanced, while 
diffusion across communities is hampered. A common hypothesis is that memes and behaviors are complex 
contagions. We show that, while most memes indeed spread like complex contagions, a few viral memes 
spread across many communities, like diseases. We demonstrate that the future popularity of a meme can be 
predicted by quantifying its early spreading pattern in terms of community concentration. The more 
communities a meme permeates, the more viral it is. We present a practical method to translate data about 
community structure into predictive knowledge about what information will spread widely. This connection 
contributes to our understanding in computational social science, social media analytics, and marketing 
applications. 



D 



iseases, ideas, innovations, and behaviors spread through social networks 1 " 12 . With the availability of large- 
scale, digitized data on social communication 1314 , the study of diffusion of memes (units of transmissible 
information) has become feasible recently 15 " 18 . The questions of how memes spread and which will go viral 
have recently attracted much attention across disciplines, including marketing 619 , network science 20 ' 21 , commun- 
ication 22 , and social media analytics 23 " 25 . Network structure can greatly affect the spreading process 15,26,27 ; for 
example, infections with small spreading rate persist in scale-free networks 8 . Existing research has attempted to 
characterize viral memes in terms of message content 22 , temporal variation 16,24 , influential users 19,28 , finite user 
attention 18,21 , and local neighborhood structure 10 . Yet, what determines the success of a meme and how a meme 
interacts with the underlying network structure is still elusive. A simple, popular approach in studying meme 
diffusion is to consider memes as diseases and apply epidemic models 3,4 . However, recent studies demonstrate 
that diseases and behaviors spread differently; they have therefore been referred to as simple versus complex 
contagions, respectively 9,29 . 

Here we propose that network communities 30 " 32 — strongly clustered groups of people— provide a unique 
vantage point to the challenge of predicting viral memes. We show that (i) communities allow us to estimate 
how much the spreading pattern of a meme deviates from that of infectious diseases; (ii) viral memes tend to 
spread like epidemics; and finally (iii) we can predict the virality of memes based on early spreading patterns in 
terms of community structure. We employ the popularity of a meme as an indicator of its virality; viral memes 
appear in a large number of messages and are adopted by many people. 

Community structure has been shown to affect information diffusion, including global cascades 33,34 , the speed 
of propagation 35 , and the activity of individuals 36,37 . One straight-forward effect is that communities are thought 
to be able to cripple the global spread because they act as traps for random flows 35,36 (Fig. 1(A)). Yet, the causes and 
consequences of the trapping effect have not been fully understood, particularly when structural trapping is 
combined with two important phenomena: social reinforcement and homophily. Complex contagions are sens- 
itive to social reinforcement each additional exposure significantly increases the chance of adoption. Although the 
notion is not new 38 , it was only recently confirmed in a controlled experiment 9 . A few concentrated adoptions 
inside highly clustered communities can induce many multiple exposures (Fig. 1(B)). The adoption of memes 
within communities may also be affected by homophily, according to which social relationships are more likely to 
form between similar people 39,40 . Communities capture homophily as people sharing similar characteristics 
naturally establish more edges among them. Thus we expect similar tastes among community members, making 
people more susceptible to memes from peers in the same community (Fig. 1(C)). Straightforward examples of 
homophilous communities are those formed around language or culture (Fig. 1(D,E)); people are much more 
likely to propagate messages written in their mother tongue. Separating social contagion and homophily is 
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Figure 1 | The importance of community structure in the spreading of social contagions. (A) Structural trapping: dense communities with few outgoing 
links naturally trap information flow. (B) Social reinforcement people who have adopted a meme (black nodes) trigger multiple exposures to others (red 
nodes). In the presence of high clustering, any additional adoption is likely to produce more multiple exposures than in the case of low clustering, 
inducing cascades of additional adoptions. (C) Homophily: people in the same community (same color nodes) are more likely to be similar and to adopt 
the same ideas. (D) Diffusion structure based on retweets among Twitter users sharing the hashtag #USA. Blue nodes represent English users and red 
nodes are Arabic users. Node size and link weight are proportional to retweet activity. (E) Community structure among Twitter users sharing the hashtags 
#BBC and #FoxNews. Blue nodes represent #BBC users, red nodes are #FoxNews users, and users who have used both hashtags are green. Node size is 
proportional to usage (tweet) activity, links represent mutual following relations. 



difficult 41,42 , and we interpret complex contagion broadly to include 
homophily; we focus on how both social reinforcement and homo- 
phily effects collectively boost the trapping of memes within dense 
communities, not on the distinctions between them. 

To examine and quantify the spreading patterns of memes, we 
analyze a dataset collected from Twitter, a micro -blogging platform 
that allows millions of people to broadcast short messages ('tweets'). 
People can 'follow' others to receive their messages, forward 
('retweet' or "RT" in short) tweets to their own followers, or mention 
('@' in short) others in tweets. People often label tweets with topical 
keywords ('hashtags'). We consider each hashtag as a meme. 

Results 

Communities and communication volume. Do memes spread like 
complex contagions in general? If social reinforcement and 
homophily significantly influence the spread of memes, we expect 
more communication within than across communities. Let us define 
the weight w of an edge by the frequency of communication between 
the users connected by the edge. Nodes are partitioned into dense 
communities based on the structure of the network, but without 
knowledge of the weights (see Methods). For each community c, 
the average edge weights of intra- and inter- community links, 
(w^) c and (Wf^c, quantify how much information flows within 
and across communities, respectively. We measure weights by 
aggregating all the meme spreading events in our data. If memes 
spread obliviously to community structure, like simple contagions, 
we would expect no difference between intra- and inter- community 
links. By contrast, we observe that the intra-community links carry 



more messages (Fig. 2(A)). Similar results have been reported from 
other datasets 35 ' 37 . In addition, by defining the focus of an individual 
as the fraction of activity that is directed to each neighbor in the same 
community, f%y or in different communities,/^, we find that people 
interact more with members of the same community (Fig. 2(B)). All 
the results are statistically significant (p^CO.001) and robust across 
community detection methods (see Supplementary Information for 
additional details). 

Meme concentration in communities. These results suggest that 
communities strongly trap communication. To quantify this effect 
for individual memes, let us define the concentration of a meme in 
communities. We expect more concentrated communication and 
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Figure 2 | Meme concentration in communities. We measure weights 
and focus in terms of retweets (RT) or mentions (@). We show (A) 
community edge weight and (B) user community focus using box plots. Boxes 
cover 50% of data and whisker cover 95%. The line and triangle in a box 
represent the median and mean, respectively. 
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meme adoption within communities if the meme spreads like a 
complex contagion. To gauge this effect, we introduce four 
baseline models. The random sampling model (M x ) assumes equal 
adoption probability for everyone, ignoring network topology and all 
activity. The simple cascade model (M 2 ) simulates the spreading of 
simple contagions 43 . The social reinforcement model (M 3 ) employs a 
simple social reinforcement mechanism in addition to considering 
the network structure. In the homophily model (M 4 ), users prefer to 
adopt the same ideas that are adopted by others in the same 
community. The simulation mechanisms of the four baseline 
models are summarized in Table 1. 

We estimate the trapping effects on memes by comparing the 
empirical data with these models. Note that we only focus on new 
memes (see definition in Methods). Let us define the concentration 
of a meme h based on the proportions of tweets in each community. 
The usage-dominant community c\h) is the community generating 
most tweets with h. The usage dominance of /z, r(/z), is the proportion 
of tweets produced in the dominant community c\h) out of the total 
number of tweets T{h) containing the meme. We also compute the 
usage entropy H\h) based on how tweets containing h are distributed 
across different communities. The relative usage dominance 
r(h)/r Ml (h) and entropy H^h) j H l Mx {\i) are calculated using M Y 
as baseline. Analogous concentration measures can be defined based 
on users. Letg(/z) be the adoption dominance of /z, i.e., the proportion 
of the U(h) adopters in the community with most adopters. The 
adoption entropy H u (h) is computed based on how adopters of h 
are allocated across communities. The higher the dominance or 
the lower the entropy, the stronger the concentration of the meme. 
All measures are computed only based on tweets containing each 
meme in its early stage (first 50 tweets) to avoid any bias from the 
meme's popularity. 

Figures 3(A-D) demonstrate that non-viral memes exhibit con- 
centration similar to (or stronger than) baselines M 3 or M 4 , suggest- 
ing that these memes tend to spread like complex contagions. Note 
that models M 2 , M 3 , and M 4 produce stronger concentration than 
random sampling (M x ), because M 2 incorporates the structural trap- 
ping effect in simple cascades, M 3 considers both structural trapping 
and social reinforcement, and M 4 captures both structural trapping 
and homophily. 

Do all memes spread like complex contagions? While the majority 
of memes are not viral, viral memes are adopted differently. Their 
concentration in the empirical data is the same as that of the simple 
cascade model M 2 (see the gray areas in Fig. 3(A-D)); community 
structure does not seem to trap successful memes as much as others. 
These memes spread like simple contagions, permeating through 
many communities. 



Strength of social reinforcement. To further distinguish viral 
memes from others in terms of types of contagion, let us explicitly 
estimate the strength of social reinforcement. For a given meme h, we 
count the number of exposures that each adopter has experienced 
before the adoption and compute the average exposures across all 
adopters, representing the strength of social reinforcement on h, 
labelled as N(h). The exposures can be measured in terms of tweets 
NXh) or users N u (h). We compute relative average exposures, 
N(h)/N Ml (h), using only tweets at the early stages (first 50 tweets). 
If this quantity is large, adoptions are more likely to happen with 
multiple social reinforcement and thus the meme spreads like a 
complex contagion. As shown in Fig. 3(E-F), viral memes require 
as little reinforcement as the simple cascade model M 2 , while non- 
viral memes need as many exposures as M 3 or M 4 . We arrive at the 
same conclusion: viral memes spread like simple contagions rather 
than like complex ones. 

Prediction. The above findings imply an intriguing possibility: high 
concentration of a meme would hint that the meme is only interest- 
ing to certain communities, while weak concentration would imply a 
universal appeal and therefore might be used to predict the virality of 
the meme. To illustrate this intuition about the predictive power of 
the community structure, we show in Fig. 4 how the diffusion pattern 
of a viral meme differs from that of a non -viral one, when analyzed 
through the lens of community concentration. 

Let us therefore apply a machine learning technique, the random 
forests classification algorithm, to predict meme virality based on 
community concentration in the early diffusion stage. We employ 
two basic statistics based on early popularity and three types of 
community-based features in the prediction model, listed below. 

1. Basic features based on early popularity. Two basic statistical 
features are included in the prediction model. The number of 
early adopters is the number of distinct users who generated the 
earliest tweets. The number of uninfected neighbors of early 
adopters characterizes the set of users who can adopt the meme 
during the next step. 

2. Infected communities. The simplest feature related to com- 
munities is the number of infected communities, i.e., the num- 
ber of communities containing early adopters. 

3. Usage and adoption entropy. H\h) and H u (h) are good indi- 
cators of the strength of meme concentration, as shown in 
Fig. 3. 

4. Fraction of intra- community user interactions. We count 
pair-wise user interactions about any given meme, and cal- 
culate the proportion that occur between people in the same 
community. 



Table 1 


Baseline models 


for information diffusion 








Community effects 






Network 


Reinforcement Homophily 


Simulation implementation 


M } 






For a given hashtag In, randomly samples the same number of tweets or users as in 
the real data. 


M 2 


/ 




A/I2 takes the network structure into account while neglecting social reinforcement 
and homophily. M2 starts with a random seed user. At each step, with probability p, 
an infected node is randomly selected and one of its neighbors adopts the meme, or 
with probability 1 - p, the process restarts from a new seed user (p = 0.85). 


M 3 


/ 


/ 


The cascade in A/I3 is generated similarly to M2 but at each step the user with the 
maximum number of infected neighbors adopts the meme. 


M 4 


/ 


/ 


In A/I4, the simple cascading process is simulated in the same way as in M2 but subject 
to the constraint that at each step, only neighbors in the same community have a 
chance to adopt the meme. 
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Figure 3 | Meme concentration in communities. Changes in meme concentration as a function of meme popularity are illustrated by plotting 
relative (A) usage dominance, (B) adoption dominance, (C) usage entropy, and (D) adoption entropy. The relative dominance and entropy ratios are 
averaged across hashtags in each popularity bin, with popularity defined as number of tweets T or adopters U; error bars indicate standard errors. 
Gray areas represent the ranges of popularity in which actual data exhibit weaker concentration than both baseline models M 3 and M 4 . The effect of 
multiple social reinforcement is estimated by average exposures for every meme. The exposures can be measured in terms of (E) tweets or (F) users. 
Similar results for different types of networks and community methods are described in SI. 



Our method aims to discover viral memes. To label viral memes, 
we rank all memes in our dataset based on numbers of tweets or 
adopters, and define a percentile threshold. A threshold of 0 T or 0 V 
means that a meme is deemed viral if it is mentioned in more tweets 
than 0 T % of the memes, or adopted by more users than 0 V % of the 
memes, respectively. All the features are computed based on the first 
50 tweets for each hashtag h. Two baselines are set up for compar- 
ison. Random guess selects n vim \ memes at random, where n VlYSi \ is me 
number of viral memes in the actual data. Community-blind predic- 
tion employs the same learning algorithm as ours but without the 
community-based features. We compute both precision and recall 
for evaluation; the former measures the proportion of predicted viral 
memes that are actually viral in the real data, and the latter quantifies 
how many of the viral memes are correctly predicted. Our commun- 
ity-based prediction excels in both precision and recall, indicating 
that communities are helpful in capturing viral memes (Fig. 5). For 
example, when detecting the most viral memes by users {0 V = 90), 
our method is about seven times as precise as random guess and over 
three times as precise as prediction without community features. We 
achieve a recall over 350% better than random guess and over 200% 
better than community-blind prediction. Similar results are obtained 
using different community detection methods or different types of 
social network links (see SI). 



Discussion 

Despite the vast and growing literature on network communities, 
the importance of community structure has not been fully explored 
and understood. Our findings expose an important role of com- 
munity structure in the spreading of memes. While the role of 
weak ties between different communities in information diffusion 
has been recognized for decades 35,36 , we provide a direct approach 
for translating data about community structure into predictive 
knowledge about what information will spread virally. Our method 
does not exploit message content, and can be easily applied to any 
socio-technical network from a small sample of data. This result 
can be relevant for online marketing and other social media appli- 
cations. 

Further analyses of network community structure in relation to 
social processes hold potential for characterizing and forecasting 
social behavior. We believe that many other complex dynamics of 
human society, from ethnic tension to global conflicts, and from 
grassroots social movements to political campaigns 17,44,45 , could be 
better understood by continued investigation of network structure. 

Methods 

We collected a 10% sample of all public tweets from Mar 24 to Apr 25, 2012 using the 
Twitter streaming API (dev.twitter.com/docs/streaming-apis). Only tweets written in 
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Figure 4 | Evolution of two contrasting memes (viral vs. non- viral) in terms of community structure. We represent each community as a node, 
whose size is proportional to the number of tweets produced by the community. The color of a community represents the time when the hashtag is first 
used in the community. (A) The evolution of a viral meme (#ThoughtsDuringSchool) from the early stage (30 tweets) to the late stage (200 tweets) of 
diffusion. (B) The evolution of a non-viral meme (#ProperBand) from the early stage to the final stage (65 tweets). 



English are extracted. The dataset comprises 121,807,378 tweets generated by 
14,599,240 unique users, and containing at least one of 10,393,465 hashtags. We then 
constructed an undirected, unweighted network based on reciprocal following rela- 
tionships between 595,460 randomly selected users, as bi-directional links reflect 
more stable and reliable social connections. Such a conservative choice to exclude 



information about direction and weights of links makes the approach more generally 
applicable to cases where static data about the social network is more readily available 
than dynamic data about information flow. Two other types of networks constructed 
on the basis of retweets and mentions were also tested for robustness (see extended 
analyses in SI). 



□ Random guess □ Community-blind prediction ■ Community-based prediction 




Figure 5 | Prediction performance. We predict whether a meme will go viral or not; a meme is labeled as viral if it produces more tweets ( T) or is 
adopted by more users ( U) than a certain percentile threshold (6 = 70, 80, 90) of memes. We use the random forests classifier trained on community 
concentration features, which are calculated based on the initial n = 50 tweets for each meme. Prediction results are robust across different networks 
and community detection methods (see SI). We compute precision and recall to compare our prediction results against two baselines. 
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We apply Infomap 31 , an established algorithm to identify the community structure. 
To ensure the robustness of our results, we perform the same analyses using another 
widely-used but very different community detection method, link clustering 32 . The 
results are similar (see details in SI). The network remains unweighted for community 
identification, to focus purely on the connection structure. 

For quantifying meme concentration in communities and the strength of social 
reinforcement, we focus on new memes that emerged during our observation time 
window. New memes are defined as those with fewer than 20 tweets during the 
previous month (Feb 24 - Mar 23, 2012). A sensitivity test of our results with respect 
to hashtag filtering criteria is available in SI. 

To replicate the Twitter API sampling effect in the baseline models, each simu- 
lation runs until 10 times more tweets are generated than the empirical numbers. 
Then, we select 10% of the tweets at random. Every simulation is repeated 100 times 
and the 10% -sampling is repeated 10 times on each simulation outcome. Thus, the 
average values of the measures from our toy models are computed across 100 X 10 
samples. 

In prediction, we use the random forest algorithm, an ensemble classifier that 
constructs 500 decision trees 46 . Each decision tree is trained with 4 random features 
independently and the final prediction outcomes combine the outputs of all the trees. 
The good performance of the random forest model benefits from the assumption that 
an ensemble of "weak learners" can form a "strong learner." For training and testing, 
we employ 10-fold cross validation. 
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